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Abstract 

The  RNA  world  hypothesis  views  modern  organisms  as  descendants  of  RNA  molecules.  The  earliest  RNA  molecules  must 
have  been  random  sequences,  from  which  the  first  genomes  that  coded  for  polymerase  ribozymes  emerged.  The 
quasispecies  theory  by  Eigen  predicts  the  existence  of  an  error  threshold  limiting  genomic  stability  during  such  transitions, 
but  does  not  address  the  spontaneity  of  changes.  Following  a  recent  theoretical  approach,  we  applied  the  quasispecies 
theory  combined  with  kinetic/thermodynamic  descriptions  of  RNA  replication  to  analyze  the  collective  behavior  of  RNA 
replicators  based  on  known  experimental  kinetics  data.  We  find  that,  with  increasing  fidelity  (relative  rate  of  base-extension 
for  Watson-Crick  versus  mismatched  base  pairs),  replications  without  enzymes,  with  ribozymes,  and  with  protein-based 
polymerases  are  above,  near,  and  below  a  critical  point,  respectively.  The  prebiotic  evolution  therefore  must  have  crossed 
this  critical  region.  Over  large  regions  of  the  phase  diagram,  fitness  increases  with  increasing  fidelity,  biasing  random  drifts 
in  sequence  space  toward  'crystallization.'  This  region  encloses  the  experimental  nonenzymatic  fidelity  value,  favoring 
evolutions  toward  polymerase  sequences  with  ever  higher  fidelity,  despite  error  rates  above  the  error  catastrophe 
threshold.  Our  work  shows  that  experimentally  characterized  kinetics  and  thermodynamics  of  RNA  replication  allow  us  to 
determine  the  physicochemical  conditions  required  for  the  spontaneous  crystallization  of  biological  information.  Our 
findings  also  suggest  that  among  many  potential  oligomers  capable  of  templated  replication,  RNAs  may  have  evolved  to 
form  prebiotic  genomes  due  to  the  value  of  their  nonenzymatic  fidelity. 

Citation:  Woo  H-J,  Vijaya  Satya  R,  Reifman  J  (2012)  Thermodynamic  Basis  for  the  Emergence  of  Genomes  during  Prebiotic  Evolution.  PLoS  Comput  Biol  8(5): 
el  002534.  doi:1 0.1 371  /journal.pcbi.1 002534 

Editor:  Carl  T.  Bergstrom,  University  of  Washington,  United  States  of  America 
Received  December  2,  2011;  Accepted  April  8,  2012;  Published  May  31,  2012 

This  is  an  open-access  article,  free  of  all  copyright,  and  may  be  freely  reproduced,  distributed,  transmitted,  modified,  built  upon,  or  otherwise  used  by  anyone  for 
any  lawful  purpose.  The  work  is  made  available  under  the  Creative  Commons  CC0  public  domain  dedication. 

Funding:  This  work  was  funded  by  the  Military  Operational  Medicine  Research  Program  of  the  U.S.  Army  Medical  Research  and  Material  Command,  Fort  Detrick, 
Maryland  under  the  U.S.  Army's  Network  Science  Initiative.  The  funders  had  no  role  in  study  design,  data  collection  and  analysis,  decision  to  publish,  or 
preparation  of  the  manuscript. 

Competing  Interests:  The  authors  have  declared  that  no  competing  interests  exist. 

*  E-mail:  jaques.reifman@us.army.mil 


Introduction 

All  biological  organisms  are  evolutionarily  related.  The  salient 
characteristics  of  life  (reproduction  and  selection)  must  have 
therefore  emerged  either  gradually  or  abruptly  from  inanimate 
chemical  processes  some  time  in  the  early  history  of  the  Earth. 
Our  ever-increasing  knowledge  on  the  biochemical  and  genetic 
basis  of  modern  life  forms  should  guide  the  quest  to  understand 
this  transition,  in  addition  to  the  chemistry  of  potential  building 
blocks  [1,2]  and  geochemical  considerations  [3,4],  The  lack  of 
fossil  evidence  forces  us  to  rely  on  model  building,  which  can 
often  be  tested  experimentally  in  the  laboratory  [5] .  One  of  the 
simplest  and  most  promising  is  the  RNA  world  hypothesis 
[1,6,7],  which  proposes  RNA  molecules  as  precursors  to  modern 
life  forms  consisting  of  DNAs  as  carriers  of  genomes  and  proteins 
as  molecular  machines.  Continued  progress  in  experimental 
studies  has  yielded  a  diverse  range  of  evidences  supporting  this 
hypothesis.  In  particular,  plausible  synthetic  routes  to  nucleo¬ 
tides  [2]  and  oligomers  [8]  have  been  demonstrated.  RNA 
ribozymes  capable  of  catalyzing  RNA  replications  have  been 
designed  and  synthesized  via  in  vitro  selection  [9,10].  Extensive 
studies  of  RNA  folding  landscapes  further  demonstrate  the 
capability  of  RNAs  to  function  both  as  carriers  of  genotypes  and 
phenotypes  [11,12]. 


Conceptual  difficulties  to  this  scenario  include  the  need  for  the 
existence  of  sufficiently  concentrated  and  pure  building  blocks 
(chirally  selected  nucleotides  for  RNAs)  and  the  necessity  to 
explain  subsequent  evolutions  of  multi-chemical  autocatalytic 
systems  [13]:  the  incorporations  of  proteins  and  nonreplicative 
metabolic  networks.  In  this  context,  Nowak  and  Ohtsuki  recently 
considered  a  model  describing  a  pre-evolutionary  stage  with 
nonreplicative  chemical  selection  [14],  The  undeniable  strength  of 
the  RNA  world  hypothesis,  nevertheless,  is  that  it  has  the  potential 
to  provide  an  empirically  well-tested  pathway  for  the  transition 
from  chemistry  to  biology,  irrespective  of  its  factual  historical 
relevance.  The  relative  simplicity  of  the  model  should  also  allow 
quantitative  descriptions  that  can  complement  empirical  ap¬ 
proaches. 

Our  focus  in  this  paper,  in  particular,  is  the  transition  from  the 
first  RNA  molecules  formed,  which  must  have  been  pools  of  near¬ 
random  RNA  sequences,  to  the  first  genomes  coding  for  RNA 
ribozymes.  Crucial  in  understanding  such  an  emergence  of  the 
first  RNA  genomes  is  the  error  threshold  predicted  by  the 
quasispecies  theory  [15-17].  At  this  threshold,  the  structure  of  a 
population  of  RNA  sequences  shifts  from  being  dominated  by  a 
stable  genome  (‘master  sequence’)  to  becoming  random  pools,  or 
vice  versa.  This  transition  can  also  be  described  and  understood  in 
the  context  of  more  general  population  dynamics  models  [18,19], 
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Author  Summary 

A  leading  hypothesis  for  the  origin  of  life  describes  a 
prebiotic  world  where  RNA  molecules  started  carrying 
genetic  information  for  catalyzing  their  own  replication. 
This  origin  of  biological  information  is  akin  to  the 
crystallization  of  ice  from  water,  where  'order'  emerges 
from  'disorder.'  What  does  the  science  of  such  phase 
transformations  tell  us  about  the  emergence  of  genomes? 
In  this  paper,  we  show  that  such  thermodynamic 
considerations  of  RNA  synthesis,  when  combined  with 
kinetics  and  population  dynamics,  lead  to  the  conclusion 
that  the  'crystallization'  of  genomes  from  its  basic 
elements  would  have  been  spontaneous  for  RNAs,  but 
not  necessarily  for  other  potential  building  blocks  of 
genomes  in  the  prebiotic  soup. 


for  which  many  exact  results  have  now  been  obtained  based  on 
statistical  physics  approaches  [17,20-24],  The  error  catastrophe 
transition  is  in  the  forward  direction,  and  has  thus  been  likened  to 
‘melting’  by  Eigen  [15].  The  transition  has  recently  been  observed 
in  behaviors  of  modern  RNA  viruses  exposed  to  mutagens  [25,26] : 
a  moderate  artificial  increase  in  mutation  rates  of  viruses  can  lead 
to  a  complete  extinction  of  virus  populations.  The  error  threshold 
is  roughly  proportional  to  the  inverse  of  genome  length,  which  also 
raised  the  question  of  how  genomes  long  enough  to  encode  error 
correction  could  have  evolved  under  high  error  rates  (Eigen’s 
paradox)  [15,27,28].  Notably,  Saakian  et  al.  [29]  have  recently 
applied  analytical  treatments  of  quasispecies  theory  to  consider  this 
question.  Higher  organisms  keep  error  rates  down  to  levels  that 
are  orders  of  magnitude  lower  than  achievable  by  polymerases 
only,  using  sophisticated  error  correction  mechanisms  including 
mismatch  repair  complexes.  Tannenbaum  et  al.  [30,31]  have 
studied  the  quasispecies  models  of  organisms  posessing  mismatch 
repair  genes,  finding  transitions  analogous  to  the  classic  error 
catastrophe  transition  in  repair-deficient  mutator  frequencies. 

The  prebiotic  evolution  in  the  RNA  world  is  in  the  opposite 
direction  of  the  error  catastrophe  transition,  and  may  thus  be 
referred  to  as  ‘crystallization.’  In  an  equilibrium  fluid,  whether  one 
observes  melting  or  crystallization  is  determined  by  the  changes  in 
temperature  and  pressure.  Can  we  find  analogous  conditions  for 
the  emergence  of  the  first  genomes?  Addressing  this  question 
requires  connections  to  thermodynamics  of  RNA  synthesis.  Recent 
developments  in  the  theory  of  nucleotide  strand  replication  [32 — 
34]  provide  a  promising  new  direction  to  bridge  the  gap  between 
the  basic  chemical  thermodynamics  of  RNA  synthesis  and 
molecular  evolution.  The  mean  error  rate  of  replication  increases 
as  the  reaction  condition  approaches  equilibrium,  contributing  to 
entropy  production  [32].  With  a  combination  of  this  single¬ 
molecule  thermodynamics  and  quasispecies  theory,  a  surprisingly 
complete  analogy  to  equilibrium  fluids  was  proposed  [34],  where 
volume,  pressure,  and  temperature  are  replaced  by  replication 
velocity,  thermodynamic  force,  and  inverse  fidelity,  respectively, 
with  counterparts  of  condensation,  sublimation,  critical  point,  and 
triple  point.  Based  on  the  analysis  of  a  model  replication  kinetics 
equivalent  to  the  Jukes-Cantor  model  of  DNA  evolution  [35],  it 
was  suggested  that  the  prebiotic  evolution  of  RNA  strands  may 
have  been  biased  by  a  thermodynamic  driving  force  toward 
increasingly  higher  fidelity  of  polymerase  ribozymes  below  a 
certain  threshold  [34], 

To  what  extent  these  theoretical  predictions  are  applicable  to 
the  actual  prebiotic  evolution  that  occurred  in  the  past  must 
ultimately  be  judged  based  on  quantitative  empirical  data  from 
existing  and  new  experiments.  Here,  we  extend  our  previous  work 


[34]  and  assess  the  applicability  of  this  thermodynamic  theory  of 
molecular  evolution  to  prebiotic  evolution,  using  experimental 
data  for  polymerization  kinetics  currently  available  in  the 
literature.  Our  results  based  on  these  empirical  data  provide  a 
strong  support  for  the  main  conclusion  of  the  theoiy,  that  there  is  a 
thermodynamic  driving  force  biasing  random  sequence  evolutions 
in  the  absence  of  genomes  toward  higher  fidelity  in  a  certain 
regime  of  parameter  spaces.  With  considerations  of  the  time- 
dependent  evolutionary  behavior  of  RNA  populations,  we 
furthermore  show  that  it  is  possible  to  estimate  the  time  scales 
that  would  have  been  required  for  a  random  sequence  pool  to 
crystallize  a  newly  discovered  master  sequence  under  a  given 
thermodynamic  condition.  These  results  also  shed  new  light  on 
Eigen’s  paradox.  Most  importantly,  our  approach  enlarges  the 
scope  of  both  the  quasispecies  theory-based  discussions  of  the 
stability  of  genomes  and  biochemical  approaches  to  RNA 
replication  by  introducing  the  concept  of  thermodynamic  driving 
forces  and  constraints  in  molecular  evolution. 

Results/Discussion 

RNA  replication  kinetics  and  thermodynamics 

The  thermodynamic  theory  of  molecular  evolution  [34] 
combines  the  kinetics  and  thermodynamics  of  RNA  replication 
on  a  single-molecule  level  with  population-level  features.  We  first 
consider  the  molecular  level  description  of  RNA  synthesis  (or 
elongation):  an  elementary  step  of  insertion  by  addition  of  a 
nucleotide  (Figure  1A)  consumes  a  nucleoside  triphosphate  (NTP) 
and  produces  a  pyrophosphate  (PPi).  Its  driving  force  F  is  given  by 


F  =  F0+  In 


[NTP] 

lPPiP 


(1) 


where  Fo  is  defined  such  that  F  —  0  at  equilibrium  (see  Methods).  We 
may  estimate  the  equilibrium  constant  from  A G°  =  5.3  kcal/mol  of 
DNA  phosphodiester  bond  formation  and  AG°  =  —  10.9  kcal/mol 
of  NTP— >NMP  +  PPi  (NMP:  nucleoside  monophosphate),  yielding 
A G°  =  —5.6  kcal/mol  [36,37].  This  value  likely  overestimates  the 
magnitude  of  Fo  because  it  ignores  the  unfavorable  entropy  change 
of  binding  a  free  NTP  monomer,  leading  to  Fg  5  9. 

One  may  seek  the  origin  of  the  observed  high-fidelity  of 
polymerization  reactions  [38]  in  the  relative  thermal  instability  of 
incorrectly  formed  Watson-Crick  base  pairs.  However,  the  stability 
differences  between  correctly  and  incorrectly  inserted  nucleotide 
pairs  are  small:  an  experimental  estimate  based  on  melting 
temperature  measurements  for  the  difference  in  free  energy 
between  incorrect  and  correct  pairs  yielded  AAG  =  0.33  kcal/mol 
[39],  which  we  adopted  in  this  work.  This  value  is  the  average  of 
the  relative  stabilities  of  nucleotides  G,  C,  and  T  (0.25,  0.33,  and 
0.41  kcal/mol,  respectively  [39])  with  respect  to  A  in  a  DNA  9- 
mer  duplex  terminus  against  the  template  base  T.  The  precise 
value  depends  on  the  identity  of  the  base  pairs  at  the  terminus  and 
at  the  neighboring  position  immediately  upstream:  for  DNAs, 
duplex  stabilities  including  effects  of  mismatches  can  be  reliably 
estimated  based  on  nearest  neighbor  interactions  [40].  Longer- 
ranged  interactions  presumably  play  more  important  roles  for 
RNAs,  which  form  secondary  structures  and  higher-order  folds 
[41],  affecting  AAG  values.  Frier  et  al.  [42]  provided  values  of  free 
energy  contributions  to  the  duplex  stability  from  all  16  possible 
terminal  RNA  base  pairs  and  mismatches  next  to  4  distinct  base 
pairs  upstream  (Table  4  in  Ref.  [42]).  From  these  data,  we 
calculated  AAG  =  0.82 +  0.24  kcal/mol,  comparable  to  kgT. 

In  quantitative  descriptions  encompassing  both  the  high 
kinetic  selectivity  and  this  marginal  stability  difference,  it  is 
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Figure  1.  Kinetics  and  thermodynamics  of  RNA  replication.  A:  Single-molecule  kinetics.  B:  Population  dynamics. 
doi:10.1371/journal.pcbi.1002534.g001 


important  to  fully  take  into  account  the  reversibility  of  the 
reactions  [32].  We  adopt  the  simplest  description  of  the  kinetics 
of  polymerization,  specified  by  16  forward  and  reverse  rates, 
k+(a\b)  and  k~(a\b),  respectively,  each  corresponding  to  the 
insertion  and  its  reverse  of  a  nucleotide  (a  =  A,U,G,C)  against  a 
template  base  ( b  =  A,U,G,C ;  Figure  1A).  In  reality,  these  rates 
do  depend  on  the  identity  of  base  pairs  immediately  upstream 
[40,41],  which  may  lead  to  stalling  after  incorrect  incorporations 
[28].  More  importantly,  however,  these  rates  also  depend  on 
[NTP]  and  [PPi],  We  estimated  the  forward  rates  from  the 
available  experimental  data  of  primer  extension  under  the  far- 
from-equilibrium  limiting  condition  [9,28,43-49] .  The  backward 
rates  can  then  be  related  to  the  forward  rates  via  equilibrium 
stability. 


In  general,  the  overall  elongation  reaction  of  a  single  nucleotide 
goes  through  a  transition  state,  whose  activation  energy  is 
differentially  affected  by  the  action  of  polymerases.  If  one  ignores 
the  reverse  reaction  under  the  condition  of  [PPi]  — >0,  the 
Michaelis-Menten  kinetics  applies  for  the  primer  extension.  In 
the  limit  of  small  [NTP],  we  then  have  k+  -+kpo\/ Kj,  the  latter 
representing  the  apparent  second-order  rate  constant  with  the 
substrate  dissociation  constant  Kj  and  the  turnover  rate  of  product 
formation  kpo i  [50].  Measurements  of  polymerase-catalyzed 
reactions  show  the  selectivity  reflected  in  differences  in  Kj  for 
correct  and  incorrect  base  pairs  to  be  orders  of  magnitude  larger 
than  equilibrium  stability  differences  [39].  Examples  currently 
found  in  the  literature  are  shown  in  Tables  1  and  2,  including 
those  for  activated  nonenzymatic  polymerization  (DNA  replication 


Table  1. 

Reference  base  incorporation  rates  k+(a\b)  of  NTPs  (rows)  against  template  bases  (columns). 

A.  Nonenzymatic  [28] 

A 

T 

G 

c 

ATP 

2.0  x  10-10 

6.1  x  10-9 

5.6  x  10~10 

2.1  x  10  10 

TTP 

1.8  x  10-9 

1.6  x  10-1° 

1.6  x  10~10 

7.1  x  10-“ 

GTP 

3.1  x  10-10 

3.4x  10"10 

4.9  x  10~10 

9.6  x  10_* 

CTP 

1.2x  IQ-10 

l.Ox  10  10 

3.2x10-* 

7.2x10-" 

B.  R18  ribozyme  [9] 

A 

u 

G 

c 

ATP 

3.0  x  10-10 

1.5  x  10~6 

1.4  x  10-10 

1.8x10-" 

UTP 

8.8  x  10~* 

2.9  x  10-10 

3.0x10-* 

3.6x  10-12 

GTP 

3.8  xlO-10 

1.2x  10-7 

4.1  x  10-10 

9.0  x  10_* 

CTP 

1.2x  10“10 

1.5x  lO"10 

6.8  x  10-7 

1.8x10-" 

C.  Poliovirus 

1  3Dpul  [43] 

A 

u 

G 

c 

ATP 

8.9  x  10~5 

0.65 

6.1  x  10-4 

1.1  x  10-* 

UTP 

1.2 

4.8  xlO"5 

6.1  x  10-4 

1.1  x  lO”3 

GTP 

8.9  x  10~5 

4.8  x  10~5 

6.1  x  10-4 

15 

CTP 

8.9  x  10~5 

4.8  xlO"5 

8.3 

1.1  x  10“3 

Rates  are  defined  as  the  apparent  second  order  rate  constant  kpo\/K(/  (or  the  limit  of  k+(a\b )  for  small  [NTP])  in  units  of  pM-1s-1. 

’For  poliovirus  3Dpolf  the  mismatch  rate  has  been  reported  for  only  one  combination  (GTP|U).  We  assumed  that  the  same  ratio  A:+(G|U)//:+(A|U)  applies  to  all  NTPs 
for  each  template  base.  The  value  for  (UTP|A)  is  a  harmonic  mean  of  two  data  (0.78  and  2.7). 
doi:1 0.1 371/journal. pcbi.1 002534.t001 
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Table  2. 

Reference  base  incorporation  rates  for  DNA  polymerases. 

A.  Sulfolobus  solfataricus  P2  DNAP  IV  (Dpo4)  [44] 

A 

T 

G 

c 

ATP 

9.9  x  10-6 

7.8  xlO-2 

2.7  x  10-5 

2.6  x  10-5 

TTP 

4.1  x  10-2 

1.8x  10-5 

6.0  x  10"5 

1.2x  10"5 

GTP 

6.0  x  10-6 

7.1  x  10-5 

6.1  xlO"5 

5.5  x  10-2 

CTP 

1.3  x  10-5 

8.4  x  IQ'5 

l.lxlO"1 

1.8x  10-“ 

B.  Human  pol  //  [45] 

A 

T 

G 

c 

ATP 

1.2x  10-6 

8.0  x  10-3 

6.9  x  10-7 

6.0  x  10-7 

TTP 

5.6  x  10-2 

5.0  x  10-7 

7.1  xlO"7 

3.7  x  10-7 

GTP 

9.1  x  10~7 

4.1  x  10-6 

3.3x  lO"6 

3.0  x  lO"2 

CTP 

3.0  x  10-6 

6.1  x  10-6 

6.4  x  10-2 

2.1  x  10-6 

C.  Yeast  pol  d  [46] 

A 

T 

G 

c 

ATP 

7.7  x  10-7 

1.7x  10-2 

8.8  x  10-7 

2.8  x  10-6 

TTP 

9.8  x  10-3 

9.0  x  10-? 

2.7  x  10-6 

2.8  x  10"6 

GTP 

5.2  x  10-7 

8.0  x  10-6 

5.3  x  10-7 

2.5  x  lO”2 

CTP 

6.7  x  10-7 

1.8x  10-6 

3.2  x  lO"2 

3.3  x  10“7 

D.  Sulfolobus  solfataricus  P2  pol  B1  [47] 

A 

T 

G 

c 

ATP 

4.6  x  10~4 

2.3 

7.5  x  10"6 

1.3  x  10-“ 

TTP 

7.5  x  10”1 

3.8x10-“ 

3.7  x  10-“ 

1.6  x  10"“ 

GTP 

4.2  x  10-5 

2.8x10-“ 

3.3  x  10-5 

1.3 

CTP 

1.9x10-“ 

2.1  x  10-“ 

7.0 x  lO”1 

4.5  x  10"6 

E.  Human  mitochondrial  polymerase  (pol  y)1  [48] 

A 

T 

G 

c 

ATP 

1.4  X  10-“ 

57 

1.7x  10-“ 

6.3  x  10-“ 

TTP 

39 

2.3x10-“ 

8.0  x  10-“ 

7.0  x  10-5 

GTP 

<io-“ 

0.016 

4.4  x  10-“ 

45 

CTP 

1.9x  10-“ 

l.Ox  10-“ 

47 

2.0  x  10-5 

F.  Rat  pol  fi  [49] 

A 

T 

G 

c 

ATP 

2.1  x  10~5 

0.46 

7.7  x  10-5 

2.8  x  10~5 

TTP 

0.41 

l.Ox  10-5 

2.9  x  10-“ 

9.0  x  10-6 

GTP 

8.1  x  10~6 

1.5x  10-“ 

3.2 x  lO”5 

0.13 

CTP 

l.Ox  10-“ 

2.9  x  10-“ 

1.1 

2.1  x  10-5 

Rates  are  defined  similarly  in  the  same  units  as  in  Table  1. 

^or  pol  y,  it  was  assumed  that  £:+(G|A)~  10-4. 
doi:1 0.1371/journal.pcbi.l  002534.t002 

without  enzymes)  determined  recently  by  Chen  et  al.  [28].  Table  1, 
in  particular,  shows  the  dramatic  increase  in  the  degree  of  relative 
stabilization  of  the  transition  states  for  correct  base  pairs  in 
modern  polymerases.  The  evolution  of  polymerases  has  entailed 
two  aspects:  the  facilitation  of  the  overall  elongation  rate  and  the 
amplification  of  the  preferential  attachment  of  correct  versus 
incorrect  nucleotides.  As  we  show  below,  this  latter  aspect  of 
selectivity  evolution  leads  to  a  phase  transition-like  behavior, 


profoundly  affecting  population  dynamics  of  evolving  macromol¬ 
ecules. 

To  characterize  this  dual  aspect  of  enzyme-catalyzed  polymer¬ 
ization  reactions,  we  adopt  a  ‘reduced'  description  involving  two 
key  characteristics  of  forward  rates:  die  mean  base  incorporation 
rate  k  and  the  relative  inverse  fidelity  y  (the  ratio  of  incorrect  to 
correct  insertion  rates).  Precise  definitions  of  these  quantities  in 
terms  of  kinetic  rates  emerge  from  the  mean  field  theory  (see 
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Methods): 


higher  k  and  y  values  than  for  DNAs.  We  nevertheless  expect  their 
order  of  magnitudes  to  be  similar. 


k  = 


J2k+(°\ b) 


(2a) 


_  l/T,a*b*  k+(a\b)\ 

1  3\  Efl^+(«l*)  j’ 


(2b) 


where  b*  is  the  Watson-Crick  complementary  base  of  b  and  the 
angled  brackets  denote  a  harmonic  mean  over  distribution  u(b)  of 
template  bases: 


^  ^ h  E b  «(*)/(  •  ■ ' ) 


(3) 


Mean  field  theory 

The  kinetic  rates  and  thermodynamic  conditions  (the  value  of  F) 
allow  us  to  extract,  using  simulations  in  general  (see  Methods),  the 
main  stationary  properties  of  RNA  elongation:  the  mean  elongation 
velocity  v  (the  average  number  of  nucleotide  pairs  added  per  unit 
time)  and  error  rate  f.1  (the  average  fraction  of  mismatched  nucleotide 
pairs).  They  differ  from  their  respective  microscopic  counterparts,  k 
and  y,  because  of  varying  contributions  of  the  reverse  rates  as 
fimctions  of  F .  Importantly,  exact  analytic  expressions  for  the 
stationary  properties  can  be  obtained  if  the  kinetic  rates  have 
sufficient  symmetry:  the  set  of  k+(ci\b)  for  all  a  is  independent  of  the 
identity  of  b  (‘symmetric  template  models;’  see  Methods  and  Figure  3). 

In  Ref.  [34],  an  important  special  case  of  symmetric  template 
models 


Figure  2  shows  the  distribution  of  these  quantities  among  nine 
polymerase  systems  whose  polymerization  kinetics  have  been 
determined  experimentally  (Tables  1  and  2),  in  which  we  observe 
qualitative  trends  of  the  evolutionary  changes  reflected  on  the 
values  of  k  and  y :  the  k  values  of  modern  polymerases  are  ~  1 07 
times  larger  than  the  activated  nonenzymatic  rate,  while  the 
nonenzymatic  fidelity  (y~10~3)  implies  that  the  Watson-Crick 
structure  in  the  absence  of  enzymes  already  supports  a  fairly  high 
level  of  fidelity.  The  arrows  illustrate  the  direction  of  evolutionary 
changes  that  must  have  occurred  from  the  nonenzymatic  to 
protein-based  polymerases  via  the  polymerase  ribozymes  in  the 
RNA  world. 

The  nonenzymatic  data  are  for  the  templated  oligomerization  of 
activated  nucleotide  analogs,  the  nucleoside  5  -phosphorimidazo- 
lide,  where  PPi  is  replaced  by  the  imidazole  group  [28],  Zielinski  et 
al.  have  compared  the  kinetics  of  RNA  versus  DNA  elongation  of 
the  activated  system  [51].  They  concluded  that  RNA  elongation  is 
more  efficient  because  its  A-form  helical  structure  positions  the  3  - 
OH  group  towards  the  incoming  monomer,  whereas  contributions 
of  wobble-pairing  appeared  to  facilitate  mismatches.  This  study 
suggests  that  the  nonenzymatic  kinetic  rates  for  RNAs  may  have 


Figure  2.  Variations  of  inverse  fidelity  y  and  mean  base 
incorporation  rate  k  among  polymerases.  See  Tables  1  and  2  for 

the  references.  Arrows  show  the  likely  direction  of  evolutionary 
changes. 

doi:  1 0.1 371/journal.pcbi.l  002534.g002 


k+(a\b)=k'  +(1  -‘W*)/] ,  (4) 

equivalent  to  the  Jukes-Cantor  model  of  DNA  evolution  [35],  was 
considered.  The  Jukes-Cantor  model  is  a  two-parameter  model, 
while  general  symmetric  template  models  have  four  parameters. 

However,  to  quantitatively  assess  the  applicability  of  the  theory 
based  on  empirical  data  of  RNA  replication  kinetics,  it  is  necessary 
to  allow  all  16  values  of  k+(a\b)  to  be  independent  empirical 
parameters.  Here,  we  used  a  version  of  the  mean  field  theory  that 
generalizes  the  analytic  results  with  the  following  expressions  for 
the  elongation  velocity  v  and  error  rate  /.i  (see  Methods): 


,  k+(a\b) 

qm=m+k-{a\by 

(5a) 

'Y^q{a\b)=\, 

a 

(5b) 

•*± 

II 

2 

(3- 

(5c) 

F  =  (  E  <l(a\b))  , 

(5d) 

V»*  U 

where  q(a\b)  denotes  the  probability  to  find  a  base-paired  to  b ,  and 
Eq.  (5b),  the  normalization  condition  for  q(a\b),  determines  v(Z>), 
the  mean  velocity  of  nucleotide  addition  against  template  base  b. 
Equation  (5a)  is  a  generalization  of  the  equilibrium  Boltzmann 
distribution,  to  which  it  reduces  to  when  v(5)  =  0,  and  is  exact  for 
symmetric  template  models  (Figure  3).  Because  the  complete 
reproduction  of  an  RNA  strand  requires  a  pair  of  replications,  we 
also  considered  the  net  error  rate  f.i2  of  two  consecutive  replications: 

EE  q(a\b)q(b\c)\  .  (6) 

a*c  b  I c 

For  connections  to  thermodynamics,  one  must  calculate  the 
entropy  production  (in  units  of  kg)  per  monomer  addition  [32]: 
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Figure  3.  Numerical  tests  of  mean  field  theory.  A:  Three  components  of  <j(«|A)  as  functions  of  g  for  a  symmetric  template  model  for  which  the 
mean  field  theory  is  exact.  The  rates  were  given  by  k+(a| A)  =  (0.03, 1.0,0.05,0. 04),  Ar+ («|U)  =  ( 1.0,0.05, 0.04,0.03),  k+(a|G)  =  (0.05,0.04,0.03,1.0),  and 
k+(a[C)  =  (0. 04,0.03, 1.0, 0.05)  for  a  =  A,  U,  G,  C.  Lines  are  from  Eq.  (40).  Symbols  are  from  numerical  simulations.  B:  Test  of  site-independence  for  the 
sequence  distribution,  Eq.  (35),  with  pol  /J rates  (Table  2F).  All  symbols  were  calculated  from  numerical  simulations.  C-D:  Mean  velocity  (C)  and  error 
rate  (D)  for  the  pol  ji  kinetics,  both  with  full  experimental  kinetics  (Table  2)  and  Jukes-Cantor  version  (JC)  derived  from  the  full  kinetic  set.  Symbols  are 
from  simulations,  which  verify  that  for  JC  kinetics  the  mean  field  prediction  is  exact, 
doi:  1 0.1 371/journal.pcbi.l  002534.g003 


u(b)^q(a\b)[g(a\b)-  \nq(a\b)\,  (7) 

b  a 

where  the  first  term  in  the  square  brackets  represents  the 
contribution  of  monomer  consumptions  to  dissipation  and  the 
second  term  corresponds  to  the  disorder  creation  by  copying  errors. 
The  average  in  Eq.  (7)  is  an  arithmetic  mean  since  F  is  not  a  rate 
and  should  match  the  external  thermodynamic  force  given  by  Eq. 
(1)  in  stationary  states.  The  quantity  —  g(a\b)  is  the  free  energy 
change  in  units  of  kgT  (or  the  negative  of  entropy  production)  for 
the  addition  of  a  nucleotide  a  against  b,  with  which  the  backward 
rates  k~  are  expressed  in  terms  of  forward  rates  k+  via 


k-(a\b)  =  k+(a\b)e-gMh).  (8) 

The  dependence  of  g{a\b)  on  concentrations  of  monomers  is 
nontrivial  because  four  NTPs  compete  for  a  single  site.  Relative 
stabilities  of  correct  versus  mismatched  base  pairs  (AA G),  in 
contrast,  are  expected  to  be  largely  insensitive  to  concentrations. 
The  following  form  of  g(a\b)  reflects  this  expectation  [34]: 

g(a\b)=g-(\-daib*)lnri,  (9) 
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where  >/  =  eAAG/*B:r ,  and  the  parameter  g  accounts  for  the 
dependence  of  g(a\b)  on  concentrations  (with  mole  fractions  of 
NTPs  assumed  to  be  maintained  equal  during  variations  of 
[NTP]/[PPi]).  With  AAG  =  0.33  kcal/mol,  we  have  <7  =  1.73  at 
T  =  300  K.  Further  physical  insights  into  the  free  energy 
parameter  g  can  be  gained  by  considering  the  condition  of 
equilibrium  (see  Methods). 

It  can  be  shown  that  /(  ranges  from  a  minimum  pimin  =  3y  far 
from  equilibrium  ( F —kx ),  v—>k),  leading  to  Eq.  (2b),  to  a 
maximum  fimax  =  3/(3  +  r/)  at  equilibrium  (F'  =  v  =  0)  (see  Meth¬ 
ods).  This  variation  of  7 1  with  varying  thermodynamic  force  F  can 
be  interpreted  as  follows:  near  equilibrium,  both  the  correct  (faster) 
and  incorrect  (slower)  incorporation  steps  are  balanced  by  their 
reverse  steps,  leading  to  comparable  net  incorporation  statistics. 
Far  from  equilibrium,  the  reverse  rates  become  negligible  and  the 
faster  correct  incorporation  dominates. 

In  Figure  3A,  we  show  that  the  mean  field  theory  is  exact  for 
arbitrary  symmetric  template  models.  Figure  3B  supports  the  site- 
independence  of  q(a\b)  [Eq.  (35)  in  Methods]  for  more  general  16- 


parameter  cases.  Comparisons  of  the  mean  field  theory  predictions 
for  elongation  properties  of  pol  fl  kinetics  (Table  2)  with 
simulations  (Figure  3C— D)  show  that  the  theory  generally  gives 
reliable  results.  The  Jukes-Cantor  reduction  of  empirical  rates  [Eq. 
(4)]  based  on  Eqs.  (2)  is  also  seen  to  give  a  good  approximation 
over  all  parameter  ranges,  showing  that  the  analytical  theory 
developed  in  Ref.  [34]  provides  accurate  descriptions  of  realistic 
kinetics.  Nevertheless,  for  the  best  numerical  accuracy  of 
predictions  based  on  experimental  kinetics,  we  based  our  main 
results  in  the  following  sections  on  stochastic  simulations. 
Importantly,  however,  the  mean  field  theory  in  the  current 
application  yields  the  definitions  given  by  Eq.  (2),  in  addition  to  the 
analytical  limits  of  velocity  and  error  rate  (see  Methods),  which  we 
verified  exactly  from  simulations. 

Single-molecule  properties 

We  applied  this  single-molecule  description  of  RNA  replication 
to  three  experimental  systems:  nonenzymatic  reactions  [28], 
‘Round- 18’  (R18)  polymerase  ribozyme  [9],  and  poliovirus 
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10 
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Figure  4.  Single-molecule  elongation  properties  as  functions  of  F.  A-B:  Mean  RNA  sequence  elongation  velocity  v  in  units  of  k  (A)  and  mean 
error  rate  (B)  with  nonenzymatic,  R18  ribozyme,  and  poliovirus  3Dpo1  kinetics,  which  show  supercritical,  near-critical,  and  subcritical  behaviors, 
respectively.  Green  arrows  indicate  discontinuous  jumps  for  poliovirus.  The  diamonds  denote  pi  values  using  the  poliovirus  sequence  (instead  of 
random  sequences  for  others),  and  the  triangles  indicate  pi2  for  poliovirus.  C-D:  Mean  elongation  velocity  (C)  and  mean  error  rate  (D)  with  increasing 
fidelity  based  on  rescaled  nonenzymatic  kinetics. 
doi:1 0.1 371 /journal. pcbi.1002534.g004 
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polymerase  (3Dpo1)  [43]  (Figure  3),  each  representing  the 
beginning,  intermediate,  and  late  stages  of  evolution.  As  has  been 
previously  observed  in  Ref.  [34]  for  the  Jukes-Cantor  model,  the 
qualitative  trend  shown  in  Figure  4  parallels  that  of  fluids 
undergoing  vapor-liquid  transitions  with  decreasing  temperature 
when  pressure,  volume,  and  temperature  are  replaced  by 
thermodynamic  force  F ,  velocity  v,  and  inverse  fidelity  y, 
respectively.  The  correspondence  of  y  to  temperature  in  fluids, 
in  particular,  is  natural  because  it  is  a  microscopic  measure  of 
randomness  destroying  genomic  information.  Figure  4A,B  shows 
that  for  high  inverse  fidelity  y  (nonenzymatic),  the  elongation 
velocity  v  and  error  rate  monotonically  increase  and  decrease, 
respectively,  with  increasing  F .  A  critical  point  is  crossed 
(ribozyme)  as  y  decreases,  and  v  and  ).i  become  nonmonotonic 
(3Dpol)  discontinuous  jumps  for  decreasing  F  (‘evaporation’). 
The  error  rate  /(2  exhibits  the  same  qualitative  behavior 
(Figure  4B).  These  results  verify  the  biological  applicability  of 
the  theoretical  predictions  made  previously  in  Ref.  [34],  based  on 
known  experimental  kinetic  data  of  systems  representing  key 
milestones  of  evolutionary  processes  (Figure  2). 


The  key  question  then  is:  how  would  these  changes  in  the 
elongation  behavior  of  RNA  replication  actually  have  occurred 
during  the  prebiotic  evolution?  To  address  this  question,  we 
modeled  the  increases  in  fidelity  from  the  nonenzymatic  starting 
point  by  uniformly  rescaling  the  incorrect  incorporation  rates 
[k  +  (a\b),  a^b*]  of  the  set  of  nonenzymatic  kinetics  (Table  1)  to 
produce  different  values  of  y.  Simulations  identified  the  critical 
point  suggested  in  Figure  4A,B  at  yc  =  7.6  x  10~4  and  verified  the 
limits  of  error  rates  at  and  far  from  equilibrium  predicted  by  the 
mean  field  theory  exactly  (Figure  4C,D). 

Phase  behavior 

We  then  scanned  the  variation  of  these  phase  behavior  for 
different  values  of  y  and  F  to  generate  the  phase  diagrams  shown 
in  Figure  5,  which  confirms  that  the  qualitative  features  of  the 
Jukes-Cantor  model  phase  diagram  [34]  are  preserved  for 
empirical  RNA  replication.  However,  as  opposed  to  the  results 
in  Ref.  [34]  that  represent  generic  predictions,  Figure  5  is  based  on 
empirical  nonenzymatic  kinetics  and  its  uniform  rescaling,  with  no 
other  adjustable  parameters. 


Figure  5.  Thermodynamic  phase  diagrams  of  RNA  replication.  A-B:  The  F-y  diagrams  with  color  levels  and  contours  (black  dashed  lines) 
representing  vg  (A)  and  ;i2  (B).  The  black  solid  lines  show  the  spinodal  terminated  by  the  critical  point  (filled  circles).  The  red  solid  lines  show  the  L-C 
transition  for  A=e  and  M  =  102,  which  meets  the  spinodal  at  the  triple  point  (open  circles).  The  white  dashed  lines  show  the  boundary  of 
\  =  dvg/dy< 0  region  (smaller  y  side).  The  green  dashed  lines  show  the  analogous  region  of  F,  values  for  starvation  processes  [Fo  =  5).  The  vertical 
lines  show  the  location  of  the  nonenzymatic  y  value.  C-D:  The  y-vg  and  yyi2  diagrams.  The  green  dashed  lines  represent  the  2<0  boundary.  The  blue 
dotted  lines  give  the  maximum  and  minimum  v  and  /,i2,  respectively,  and  the  red  dotted  line  in  D  denotes  the  maximum  error  rate  at  equilibrium.  The 
fitness  vg  is  in  units  of  ko. 
doi:  1 0.1 371/journal.pcbi.l  002534.g005 
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The  discontinuous  jumps  shown  in  Figure  4C,D  correspond  to 
the  limit  of  stability  (‘spinodal’;  thick  black  lines)  of  the  ‘liquid’  or  L 
phase  (high  v-low  /(2  state)  against  the  ‘gas’  or  G  phase  (low  v-high 
p2  state).  In  equilibrium,  the  location  of  a  phase  transition  in  the 
phase  diagram  is  determined  by  the  equality  of  free  energies  of  the 
two  phases  [52,53].  Here,  we  adopted  the  assumption  that  if 
multiple  stationary  states  exist  for  a  given  F,  the  state  with  higher  v 
(and  lower  /(2)  is  chosen.  This  assumption  is  based  on  the 
relationship 


S  =  Fj2vl  =  NFvg’  (10) 

1 

connecting  the  entropy  production  rate  S  to  F  and  the  velocity  V; 
of  RNA  replicator  /  present  in  the  system,  where  N  is  the  total 
number  of  replicators  and  Vg  is  the  mean  velocity  (or  ‘fitness’).  The 
analogy  to  equilibrium  phase  behavior  also  excludes  the  first-order 
character  of  liquid-solid  transitions,  which  for  the  current  case  is 
continuous.  Equation  (10)  is  a  special  case  of  a  general  relationship 
between  nonequilibrium  fluxes  and  conjugate  forces  [52].  In  this 
formulation,  a  state  with  high  vg  contributes  more  to  entropy 
production.  This  assumption  is  consistent  with  the  standard 
interpretation  of  the  replication  rate  as  a  measure  of  fitness 
[15,54],  The  multiplicity  of  stationary  states  at  a  single-molecule 
level  is  supported  by  the  recent  demonstration  of  a  real-time 
sequencing-by-polymerization  technique  [55],  where  it  was 
reported  that  polymerases  interconverted  between  two  distinct 
velocities  during  DNA  elongation  for  a  given  reaction  condition 
(Figures  3C  and  S3  of  Ref.  [55]).  A  complete  kinetic  character¬ 
ization  of  the  0-29  polymerase  used  in  this  experiment  would  allow 
us  to  make  a  more  quantitative  assessment  of  this  interpretation. 


In  a  population  of  self-replicating  RNAs  with  genotypes  labeled 
by  index  i,  the  genotype  i  catalyzes  replications  with  rates 

k+  (a\b)  =  ki  K(a\b),  (11) 

where  k ,  is  the  equivalent  of  Eq.  (2a)  for  the  genotype  i  specified  by 
a  fitness  landscape.  The  relative  rate  K(a\b)  specifies  the  rate  of 
addition  of  nucleotide  a  against  base  b ,  all  normalized  such  that 
222a  K(.a\b)1b  =  1.  In  this  model,  therefore,  all  genotypes  have  the 
same  set  of  relative  enzymatic  rates  for  nucleotide  pairs  (and  the 
same  value  of  y),  while  differing  in  their  absolute  magnitude  of 
catalysis,  k This  assumption  of  uniform  inverse  fidelity  is 
reasonable  for  populations  with  genotypes  distributed  within  a 
small  neighborhood  of  a  master  sequence  (or  a  small  random 
subspace  in  the  absence  of  a  master)  in  the  sequence  space.  The 
elongation  velocity  V;  is  given  by 

Vi  =  vkh  (12) 

where  (the  relative  velocity)  v  is  now  determined  from  Eqs.  (5c) 
with  k+(a\b)  replaced  by  K+(a\b),  and  the  replication  rate  of 
genotype  i  is 


2  M 


(13) 


because  a  pair  of  replication  events  requires  the  addition  of  2 M 
nucleotides,  where  M  is  the  length  of  genome. 

The  mutation  rate  Qy  from  genotype  j  to  i  is  given  by 


e(,=(i-/t2)M-rfi/(/t2/3)^,  (i4) 


Population  dynamics 

In  considering  the  thermodynamic  interpretation  of  the 
population  dynamics  of  RNA  sequences,  we  adopt  the  following 
physical  model  (Figure  1):  during  evolutionary  drifts  of  a  random 
population  in  sequence  space,  a  particular  sequence  that  folds  and 
catalyzes  the  replication  of  RNAs  with  the  same  sequence  (and  no 
others)  is  ‘discovered.’  (In  reality,  a  ribozyme  would  more  likely 
have  had  catalytic  activities  for  arbitrary  sequences.  The  selectivity 
toward  its  own  sequence,  instead,  would  have  arisen  from  the  need 
for  spatial  diffusion  in  order  to  act  on  other  sequences.)  This 
sequence  therefore  has  a  higher  k  value  [Eq.  (2a)]  compared  to 
others,  leading  to  the  single-peak  Eigen  landscape  [Eq.  (18) 
below].  Our  goal  in  this  and  the  following  subsections  is  to 
describe  the  growth  and  stability  of  this  master  sequence.  In  Ref. 
[34] ,  the  basic  quasispecies  theory  under  the  single-peak  landscape 
was  combined  with  the  theory  of  a  single-molecule  elongation.  We 
expanded  this  treatment  by  considering  different  scenarios  of  how 
F  and  y  may  have  been  distributed  in  RNA  populations 
(Figure  IB). 

For  the  inverse  fidelity  y,  one  may  first  assume  that  it  is  nearly 
uniform  (or  regard  it  as  an  average  over  replicators)  in  a 
population,  as  has  been  assumed  implicitly  in  Ref.  [34],  We  also 
assumed  that  only  the  RNA  strands  with  a  certain  polarity 
(analogous  to  the  positive  or  negative-sense  polarities  of  viral 
genomes  [56])  have  catalytic  activities,  such  that  a  pair  of 
replication  events  is  necessary  to  reproduce  a  polymerase 
ribozyme.  This  feature  makes  the  current  treatment  more  realistic 
for  RNA  prebiotic  evolution  compared  to  those  in  Ref.  [34] .  The 
following  derivation  of  the  thermodynamic  quasispecies  theory  in 
this  subsection  otherwise  adopts  the  approach  therein  [34], 


where  dy  is  the  Hamming  distance  (the  number  of  nucleotides  that 
are  different)  between  i  and  j.  Denoting  the  number  of  individuals 
(of  the  polarity  that  has  catalytic  activity)  with  genotype  i  as  the 
evolving  population  in  the  Eigen  model  [15,16]  without 

constraints  on  the  population  size  obeys  die  dynamical  equation, 

-  V  Qur,Hj.  (15) 

i 

At  any  time  t,  the  total  number  of  all  individuals  (population  size) 
N  is  given  by  N  =  JT  which  from  Eq.  (15)  changes  via  N  —  rN, 
where  r=  JN  r,p ,  is  the  population  growth  rate  (mean  fitness)  with 
the  frequency  of  genotype  i,  Pi  =  iij/N.  Therefore,  for  a  given 
population  characterized  by  the  set  {«,},  the  corresponding 

entropy  production  rate  is  given  by  Eq.  (10)  with 

Vg  =  2  Mr.  (16) 

Similarly,  under  an  idealized  condition  where  replication  occurs 
together  with  degradation  [15,17,57],  a  population  can  evolve 
under  a  constant  F  with  a  fixed  mean  population  size.  In  this  case, 
a  replication  event  occurs  with  the  same  rate  as  the  random 
degradation  of  a  replicator.  The  evolution  equation  becomes 

»t  =  ^  Qyi-jUj-rrii,  (17) 

j 

such  that  N  is  constant.  For  the  fitness  landscape,  we  adopted  the 
single-peak  Eigen  landscape: 


PLoS  Computational  Biology  |  www.ploscompbiol.org 


9 


May  2012  |  Volume  8  |  Issue  5  |  el  002534 


Thermodynamic  Basis  for  the  Emergence  of  Genomes 


ki  =  k o  x 


A 

1 


if  i  is  the  master  sequence, 
otherwise, 


(IB) 


where  ko  is  a  constant  with  the  unit  of  a  rate  and  A  >  1  is  the 
relative  fitness  of  the  master  sequence. 

Under  these  simplifying  approximations,  the  standard  quasis¬ 
pecies  theory  becomes  applicable  directly,  with  connections  to 
thermodynamics  made  by  [i2  =  f-h(y,F)  and  v  =  v(y,F).  These 
fundamental  relationships  linking  elongation  properties  to  ther¬ 
modynamic  and  kinetic  parameters  can  be  written  in  implicit  but 
closed  analytical  forms  [34]  for  the  Jukes-Cantor  model.  In  the 
numerical  approach  adopted  here  for  arbitrary  rates,  simulations 
are  first  performed  for  a  given  set  of  rate  constants  and  y  values  to 
obtain  averages  of  v,  / i2 ,  and  F  values  as  functions  of  g,  as 
illustrated  in  Figure  1  of  Ref.  [34],  The  implicit  parameter  g  is 
then  eliminated  to  obtain  v  and  /(2  as  functions  of  F  (Figure  3C— 
D).  For  the  region  in  which  multiple  branches  of  v  exist  for  a  given 
F,  the  branch  with  the  largest  v  (L  phase)  is  chosen. 

In  the  infinite  population  limit,  the  quasispecies  is  either 
dominated  by  the  ‘master  species’  with  the  mean  fitness 
r  =  (k0v/2M)A(l-  n2)M  ~(k0v/2M)Ae-lh-M ,  where  (l-/t2)M 
is  the  probability  of  replicating  M  sites  over  two  consecutive  cycles 
without  error  [see  Eq.  (6)],  or  by  the  ‘mutant  species’  with  fitness 
r  =  kov/2M.  From  Eqs.  (13)  and  (16),  the  mean  fitness  is  therefore 
given  by 


vg  =  ko  v  x 


Ae  ^2 M  if  yu2<Ju*, 
1  otherwise, 


(19) 


where 


tA(y,F)=—  (20) 

denotes  the  threshold  error  rate  for  which  v„  becomes  the  same  for 
the  master  mutant  species.  Equation  (19)  implies  that  a  constant- 
f.i2  contour  (red  lines  in  Figure  5)  is  the  ‘melting  line’  separating  a 
‘crystalline’  (C)  phase  from  the  L  phase  [34],  As  shown  in  Figure  5, 
the  L-C  transition  line  meets  the  L-G  line  at  the  triple  point,  below 
which  the  C  and  G  phases  meet  directly  (‘sublimation’).  Our 
results  show  that  this  fairly  complete  analogy  to  the  equilibrium 
phase  behavior  of  fluids  discovered  first  in  Ref.  [34]  is  indeed 
equally  applicable  in  more  realistic  considerations  of  RNA 
prebiotic  evolution. 

The  L-C  transition  line  lies  at  the  heart  of  the  crystallization  of 
genomes  that  may  occur  during  evolutionary  walks  [13]  in 
sequence  space.  The  presence  of  the  L  phase  distinct  from  the  G 
phase  below  the  critical  point  has  an  important  consequence  to 
such  sequence  explorations:  despite  the  absence  of  a  stable 
genome,  analogous  to  liquid  phases  with  short-range  orders,  RNAs 
in  the  L  phase  with  /t2  ~  10~2  (Figure  5D)  would  still  exhibit 
sequence  correlations  for  a  significantly  large  number  of  gener¬ 
ations.  We  may  use  the  Jukes-Cantor  relationship  between  the 
error  rate  and  the  cumulative  mean  Hamming  distance  d  from  an 
ancestral  sequence  after  l  generations  [35], 


do  so  in  /~  41  generations.  Therefore,  when  a  system  ‘evaporates’ 
into  the  G  phase,  an  ancestral  sequence  gets  lost  in  a  couple  of 
generations.  In  contrast,  conditions  in  the  L  phase,  with  error  rates 
comparable  to  those  in  the  C  phase  nearby  in  the  phase  diagram, 
would  greatly  facilitate  crystallizations  of  viable  genomes. 

In  interpreting  the  physical  distinction  between  L  and  G  phases, 
it  is  useful  again  to  compare  them  with  their  analogs  in  equilibrium 
fluids,  the  liquid  and  gas  phases  in  a  container.  The  pressure  of  a 
fluid  in  equilibrium  is  controlled  by  the  external  force  per  unit  area 
of  the  container,  which  matches  the  average  of  microscopic  forces 
per  unit  area  exerted  by  molecules  on  the  wall  interior.  At  high 
temperatures  (the  average  kinetic  energy  of  molecules),  a  given 
external  pressure  can  be  balanced  by  the  mean  force  of  a  state 
(gas),  where  density  is  low  and  molecules  rarely  interact.  The 
equilibrium  density  is  then  roughly  proportional  to  pressure  and 
inversely  proportional  to  temperature.  At  low  enough  tempera¬ 
tures,  a  given  external  pressure  can  also  be  matched  by  a  different 
phase  (liquid)  with  a  much  higher  density  held  together  by 
intermolecular  attractions.  Both  gas  and  liquid  phases  are 
characterized  by  the  lack  of  long-range  order.  The  sharp  boundary 
between  them  appears  when  temperature  goes  below  the  critical 
value  because  the  effect  of  molecular  interaction  renders  a  certain 
range  of  pressure  values  unstable. 

Analogously,  an  RNA  molecule  replicating  in  a  chemical 
reservoir  is  driven  by  the  external  thermodynamic  force  given  by 
Eq.  (1),  which  matches  the  average  entropy  production  per 
monomer  addition.  For  large  y  values,  the  replication  is  nearly 
random  and  the  second  term  of  Eq.  (7),  the  sequence  disorder 
contribution  to  the  entropy  production,  is  constant  (cr  In  4), 
making  the  dependence  of  internal  Fong  monotonic  [34] .  With  a 
sufficiently  small  y,  in  contrast,  the  sequence  disorder  nearly 
vanishes,  reducing  the  entropy  production.  This  change  is 
compensated  by  the  dominance  of  faster  correct  incorporation 
steps,  with  the  corresponding  increase  in  velocity  and  decrease  in 
error  rates.  A  given  value  of  external  F  can  be  matched  either  by  a 
state  with  low  velocity  and  high  errors  (G  phase),  or  by  one  with 
high  velocity  and  low  errors  (L  phase),  each  distinguished  by  the 
relative  importance  of  the  two  terms  in  the  square  brackets  in  Eq. 
(7).  The  sharp  boundary  between  them  appears  because,  for 
intermediate  values  of  v,  stationary  states  become  unstable  against 
fluctuations.  The  neighborhood  of  regimes  where  the  C  phase  is 
stable  is  dominated  by  the  L  phase  (Figure  5)  in  which  the  error 
rate  is  comparable  to  those  in  the  C  phase,  if  y  is  subcritical. 

For  the  population  as  a  whole,  random  drifts  in  y  due  to 
sequence  explorations  are  not  isotropic  but,  rather,  are  biased 
toward  the  direction  of  increasing  vg.  In  our  previous  work  [34],  a 
threshold  was  identified  within  the  phase  diagram  separating 
regimes  where  the  direction  of  this  bias  shifts.  We  sought  the 
analog  of  this  threshold  in  Figure  5  corresponding  to  the  evolution 
of  RNAs,  where  the  region  in  which  X  =  dvg/dy  <  0  in  the  L  phase 
(white  dotted  lines  in  Figure  5A,  B)  includes  the  nonenzymatic 
fidelity  value  and  links  it  to  the  C  phase.  Inside  the  C  phase,  X  is 
always  negative  (Figure  6A).  Once  a  population  has  y  values  to  the 
left  of  the  white  dashed  line  in  Figure  5A,  random  drifts  in 
sequence  space  would  be  biased  toward  increasingly  higher 
fidelity,  leading  to  crystallization  and  stable  genomes. 


^  In  (1  —  4<7/3M).  (21) 

A  typical  sequence  in  the  G  phase  with  /t2  ~  0.5  (Figure  5D),  for 
instance,  would  evolve  to  reach  d  =  M/2  in  just  /~  1.6  genera¬ 
tions,  on  average,  whereas  in  the  L  phase  with  ^2~0.02,  it  would 


Stochastic  evolutionary  dynamics 

We  next  relaxed  the  assumption  that  y  is  uniform  within  a 
population  (y->y;  is  the  value  for  genotype  l).  Equation  (13)  is  then 
replaced  by 
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B 


y 


Figure  6.  Dependence  of  mean  fitness  on  fidelity.  A:  Mean  fitness  as  a  function  of  y  at  constant  F.  The  slope  A  =  dvg/dy  is  negative  below  a 
threshold  y  for  each  F  (white  dashed  lines  in  Figure  5A,B  and  green  dashed  lines  in  Figure  5C,D,  respectively).  The  discontinuous  jump  for  F  =  0.4  and 
the  cusps  at  smaller  y  values  correspond  to  G-L  and  L-C  transitions,  respectively.  B:  Mean  fitness  averaged  over  starvation  processes  (Fq  =  5)  for 
different  initial  thermodynamic  force  F,  (see  Figure  10).  The  slope  A,  is  negative  below  a  threshold  y  for  each  F )  (green  dotted  lines  in  Figure  5A,B). 
Vertical  lines  represent  the  nonenzymatic  fidelity.  The  fitness  vy  is  in  units  of  k0. 
doi:10.1371/journal.pcbi.1002534.g006 


r,= 


kj 

2  M 


v( Vt,F), 


(22) 


for  which  numerical  simulations  have  to  be  used.  An  efficient 
method  to  extract  collective  population  dynamics  of  competing 
molecules  is  again  provided  by  the  Gillespie  algorithm  [58],  which 
was  first  applied  to  the  quasispecies  dynamics  by  Nowak  and 
Schuster  [59].  The  set  of  possible  reactions  corresponding  to  Eq. 
(17)  a  population  can  undergo  are  written  as 


the  infinite  population  result.  These  results  show  that  the  steady 
state  reached  in  simulations  depends  neither  on  the  initial 
conditions  (single  replicator  or  a  large  population)  nor  the 
boundary  conditions  (no  degradation  or  constant  N). 

Crystallization  kinetics 

We  used  the  constant-iV  stochastic  evolutionary  dynamics 
simulations  to  examine  the  temporal  evolution  of  quasispecies. 
The  inverse  fidelity  yi  was  assumed  to  depend  on  genotype  i  via 
the  same  form  of  single  peak  landscape  as  for  fitness: 


R,  — >  R,  +  Rj 
Qjiri 


(23a) 


R,  -  0 


r 


(23b) 


where  R ,  is  a  replicator  of  genotype  i.  The  mutation  matrix  is 
given  by 


e;i  =  (l-ft.)M^'(ft/3)rf7s  (24) 

where  =  fJ.2 (~i’i-F)  is  the  error  rate  of  reactions  catalyzed  by  the 

genotype  i. 

We  tested  this  simulation  algorithm  using  the  special  case  of  the 
exponential  growth  of  a  population  with  no  degradation  [Eq.  (23a) 
only] ,  uniform  error  rate  (fij  =  [It),  and  the  initial  condition  of 
single  master  sequence  under  Eq.  (18)  (see  Methods  and  Figure  7). 
Systems  with  replication  and  degradation  [Eqs.  (23)]  using 
uniform  [h  and  initial  population  size  of  104  were  also  simulated, 
in  which  the  total  population  size  showed  moderate  diffusional 
drifts  but  roughly  remained  the  same  over  typical  trajectories,  and 
P\  decayed  to  reach  the  steady  state  values  (Figure  8)  predicted  by 


Figure  7.  Time  dependence  of  master  sequence  frequency  p\. 

Stochastic  simulation  results  of  the  Eigen  model  (solid  lines,  averaged 
over  1000  trajectories)  are  compared  with  Eq.  (54)  (dotted  line),  where 
T2=r\/A  is  the  fitness  of  mutants,  for  M=102  and  A  =  e.  The  initial 
condition  was  ni  =  5it\. 
doi:1 0.1 371/journal.pcbi.l  002534.g007 
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Figure  8.  Stationary  frequency  of  master  sequence.  Stochastic 
simulations  results  for  the  Eigen  model  are  compared  with 
p\{tx>)  =  (qA  —  \)/(A  —  \).  The  simulations  were  under  the  condition  of 
(approximately)  constant  population  size  (N  ~  104)  using  Eqs.  (23).  With 
M  =  102  and  A  =e,  the  error  threshold  where  p\  =0  is  at  ^2  =  0.01.  Error 
bars  represent  one  standard  deviations. 
doi:10.1371/journal.pcbi.l002534.g008 


{y;  if  i  is  the  master  sequence, 
y2  otherwise, 


(25) 


where  we  took  yj  =  1.0  x  10  4  and  y2  =  4.23  x  10  3  (the  nonen- 
zymatic  value)  in  Figure  9.  The  time  scale  of  simulations  is  set  by 


using  A'o/[NTP]  =6.65  x  10_9(.iM_1  s_1  front  the  nonenzymatic 
replication  (Table  1)  in  Eqs.  (18)  and  (22).  We  assumed 
[NTP]  =  1  |iM  as  a  representative  chemical  environment,  such 
that  A:o  =  0.210yr~'. 

Figure  9  shows  two  typical  trajectories  starting  from  an  initial 
pool  of  random  sequences  of  length  M=  102,  containing  a  single 
replicator  designated  as  the  master  sequence.  This  ‘seeding’  of  the 
population  by  a  master  sequence  mimics  the  situation  where  a 
genotype  with  a  significantly  higher  fitness  is  discovered  during 
random  drifts.  The  resulting  evolution  in  Figure  9  is  analogous  to 
‘crystal  growths,’  in  which  the  frequency  of  the  master  sequence 
steadily  grows  to  reach  a  value  consistent  with  the  stability  of  the  C 
phase  (Figure  5):  a  master  sequence  with  A=e  and  yi  =  1 0  4 
spreads  and  dominates  the  population  under  thermodynamic  force 
F=  1.0.  The  corresponding  growth  under  F  =  0.5,  which  corre¬ 
sponds  to  the  vicinity  of  the  L-C  boundary  in  Figure  5,  is  much 
weaker  and  slower,  suggesting  that  the  phase  diagram  remains 
valid  for  inhomogeneous  y(-.  The  estimated  time  scales  in  Figure  9 
(based  on  the  activated  nonenzymatic  rates  and  [NTP]  ~  1  (iM) 
further  suggest  that  the  crystallization  of  a  genome  can  occur 
within  ~104yrs  under  suitable  conditions.  However,  as  in 
equilibrium  fluids,  it  will  never  occur  if  thermodynamics  precludes 
a  stable  C  phase. 

Starvation  process 

We  also  considered  an  alternative  setup  where  a  population 
growth  occurs  in  a  closed  system,  which  leads  to  an  evolutionary 
change  we  refer  to  as  the  ‘starvation  process.’  Similar  situations 
were  also  considered  in  Ref.  [60].  During  an  idealized  starvation 
process,  a  single  genotype  is  placed  inside  a  medium  containing  a 
given  amount  of  NTPs  and  PPi’s  with  the  corresponding  initial 
thermodynamic  force  F,.  The  population  growth  leads  to  the 
gradual  depletion  of  NTPs  and  accumulation  of  PPi’s,  lowering  F . 


A 


Figure  9.  Crystallization  of  a  genome.  Stochastic  simulations  were  used  with  genome  length  M=102  and  mean  base  incorporation  rate 
£  =  6.65xl0~9  s-1.  The  initial  population  (1V=104)  contained  random  sequences  and  a  single  master  sequence  with  relative  fitness  A  =  e.  The 
inverse  fidelity  was  given  by  Eq.  (25)  with  yx  =  1.0  x  10~4  and  y2  =4.227  x  10~3.  A:  Relative  frequency  of  the  master  sequence  (initially  10~4).  B: 
Average  of  the  fractional  Hamming  distance  (HD;  d/M  ~0.75  initially). 
doi:10.1371/journal.pcbi.1002534.g009 
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Figure  10.  Variation  of  mean  fitness  during  starvation 
processes.  The  mean  fitness  is  shown  as  a  function  of  fractional 
population  size  The  two  y  values  (with  Fj  =  1  and  Fo  =  5)  illustrate 
typical  behavior  below  and  above  the  critical  point.  The  C-L  and  L-G 
transitions  are  indicated  for  the  subcritical  case. 
doi:10.1371/journal.pcbi.1002534.g010 


The  error  rate  therefore  increases  over  time.  The  growth  of  the 
population  would  come  to  an  end  when  the  condition  finally 
reaches  equilibrium  (F  =  0).  The  resulting  collection  of  RNAs  in 
reality  may  then  disperse  into  fresh  media,  restarting  new  rounds 
of  starvation  processes. 

We  introduce  the  fractional  population  size  £  with  respect  to  the 
asymptotic  population  reached  in  the  limit  of  equilibrium  (see 
Methods).  A  single  process  starts  with  F  =  Ft,  where  vg  is 
maximum  and  <0  —  0,  and  may  undergo  up  to  two  transitions  (C- 
L  and  L-G)  if  y<yc  to  reach  equilibrium,  where  F -* 0,  >1,  and 

>0  (Figure  10).  In  Figure  6,  the  mean  fitness  as  a  function  of  y 
averaged  over  a  starvation  process  (B)  is  compared  with  that 
without  the  averaging  (A).  For  y  close  to  the  nonenzymatic  value 
(vertical  dotted  line),  2,  =  3<(Vg)/5y  is  negative  in  the  L  phase  for 
F(  above  a  threshold.  The  dependence  of  the  Xs  <  0  region  on  Fj 
(green  dashed  line  in  Figure  5A,B)  closely  resembles  that  of  the 
2<0  region  on  F  (white  dashed  lines  in  Figure  5A,B).  We 


Figure  11.  Sensitivity  of  fidelity  threshold  on  equilibrium 
constant.  The  dependence  on  Fq  of  the  minimum  F,  of  starvation 
processes,  for  which  zJ  =  3<(vg)/c)y<0,  are  shown. 
doi:10.1371/journal.pcbi.1002534.g01 1 


therefore  conclude  that  an  environment  that  supports  repeated 
starvation  processes  with  an  initial  F  above  this  boundary  for  a 
given  y  promotes  evolution  that  lowers  y.  It  is  worthwhile  to  note 
that  this  conclusion  was  reached  without  invoking  any  significant 
simplifying  assumptions  other  than  the  experimentally  character¬ 
ized  kinetics  of  nonenzymatic  replication  (Table  1),  thermody¬ 
namic  considerations,  and  the  quasispecies  theory,  except  the 
uncertainty  in  values  of  Fq.  We  verified  that  the  conclusion 
remains  valid  for  all  possible  Fq  (Figure  1 1). 

Evolution  of  longer  genomes 

The  conclusion  that  there  was  an  underlying  driving  force 
biasing  fidelity  increases  in  the  absence  of  genomes  is  particularly 
powerful  because  it  is  independent  of  the  physical  mechanisms 
implementing  it.  A  likely  mechanism  for  such  changes  is  the 
evolution  of  error  correction  with  the  necessary  increases  in 
genome  length  M .  The  Eigen’s  paradox  arises  because  such  an 
increase  would  lower  ^  (the  melting  line  recedes  toward  smaller  y 
in  Figure  5A,B).  The  melted  population,  however,  would  be  driven 
to  recrystallize  a  new,  longer  genome  because  k  <  0  in  the  L  phase. 
Growths  in  genome  lengths  most  likely  occurred  with  insertions, 
which  is  beyond  the  scope  of  our  treatment  that  only  considered 
base  substitution  errors.  Saakian  has  studied  the  evolutionary 
model  of  parallel  mutation-selection  scheme  with  insertion  and 
deletion  [61].  Similar  approaches  combined  with  our  findings  may 
offer  more  detailed  insights  on  how  genome  growths  may  have 
been  facilitated  by  thermodynamic  driving  forces.  In  addition,  we 
have  restricted  our  study  here  to  a  single  chemical  system  (RNAs). 
It  would  be  of  interest  to  apply  similar  approaches  to  more 
complex  systems  containing  multiple  ingredients,  including 
peptides. 

Together,  our  findings  in  Figure  5  suggest  that  the  initial 
nonenzymatic  fidelity  of  RNA  lies  within  the  threshold  favoring 
fidelity  increases.  Rather  than  being  coincidental,  this  feature  may 
explain  nature’s  choice  of  NTPs  as  the  media  for  encoding 
biological  information.  Many  possible  alternative  oligomers 
capable  of  templated  replications  have  been  proposed  as 
precursors  to  RNAs  [7],  Their  corresponding  monomers,  howev¬ 
er,  would  have  had  widely  different  fidelity  values,  and  one  system 
(NTPs)  that  happened  to  lie  within  the  1  <  0  boundary  presumably 
evolved  the  RNA  quasispecies  cloud  towards  smaller  y,  eventually 
crystallizing  the  first  genomes. 

Methods 

Thermodynamic  force 

An  elementary  elongation  reaction  can  be  written  as 


R„  +  a^-R„+i,  (26) 

where  fl  =  A,U,G,C  and  R„  is  the  RNA  primer  of  length  n.  The 
entropy  of  the  system  plus  reservoir  is  S  =  S(n,Na,Np),  where  Na 
and  Np  are  the  total  numbers  of  monomers  a  and  PPi, 
respectively.  The  entropy  production  rate  is  [34,52] 


•  es .  ^  as  •  ss  ■ 
s-^n+T,8NaNa+^r-Np 


i 

T 


{f-FP)v+J2^ava 

a 


y  Fava, 


(27) 


where  f  —T8S/8n  is  the  force  acting  on  the  growing  primer  (in 
length  units  of  e.g.,  base  pair  rise),  fip=  —T8S/8Np, 
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pa  =  —T8S/ 8Na,  v  =  h  =  Np  =  ^2a  va,va=—  Na  is  the  consump¬ 
tion  rate  of  nucleotide  a.  The  force  f  is  analogous  to  the  external 
tension  balancing  the  entropic  force  of  rubber  elasticity  [52].  In 
conditions  where  the  external  force  is  not  controlled,  it  may  be 
replaced  by  frictional  drag  on  polymerases,  which  would  depend 
on  elongation  velocity.  The  constant  Fa  is  given  by 

F°=T  '  (f-flp+Ha)  =F°a  +  lnp>i]  -F°~  ln  4+  ~F-  ^28) 


or  a  single  base)  can  be  written  as 

jtPn(an,t\b)  =  k+(a,,\bn)Pn_l(a"-1,t\b) 

+  '^2k-(d\b„+i)Pn+l(aan,t\b)-k~(an\bn)Pn(a",t\b) 

r 

a 

-J2k+{a’\b  n  +  i  )P„(an,t\b), 

/ 

a 


In  Eq.  (28),  F°  =  T^l(f—p°p  +  p°a)^F°,  and  in  the  third  equality, 
we  have  assumed  that  [fl]  =  [NTP]/4.  Equation  (1)  follows  with 
Fq  =  F°  —  ln4,  and  Eq.  (32)  becomes  S  =  Fv. 

Equilibrium  condition 

A  useful  physical  insight  to  g  defined  in  Eq.  (9)  can  be  gained  by 
considering  the  condition  of  equilibrium,  which  can  be  derived 
front  Eq.  (7)  as 


qeq{a\b)  =  ^b\  (29) 

or  with  Eq.  (8),  k+  ( a\b )  =  k~  (a\b)qeq(a\b),  the  detailed  balance.  In 
equilibrium,  on  the  other  hand,  we  can  calculate  qeq  by 
considering  a  two-level  system  with  a  ground  state  and  m  —  1  - 
fold  degenerate  excited  states  with  energy  gap  A  AG  ( m  is  the  total 
number  of  NTP  types;  m  =  4  in  the  main  text): 


qeq{a\b)  = 


<W+(  i-<Wk-AAG/^r 

!+(;«—  1  )e  AAG/Ayg 


1+<W0?-1) 

in  —l  +  i/ 


(30) 


Comparing  diis  with  Eqs.  (9)  and  (29),  we  obtain 


geq=-ln(^  +  m—iy  (31) 

which  we  also  verify  in  the  next  subsection  directly  from  the  mean 
field  theory.  We  may  interpret  in—  1  and  I]  in  Eq.  (31)  as  die 
entropic  and  energetic  factors  for  mismatches:  they  are  cosdy 
individually  (by  a  factor  of  q)  but  there  are  many  of  them  (;??>  1). 
Equation  (29)  shows  that  the  free  energy  parameter  g  is  defined 
with  a  constant  term  such  that  it  absorbs  the  partition  function  that 
normalizes  qeq{a\b)  in  equilibrium.  The  mean  field  theory 
generalizes  Eq.  (29)  into  Eqs.  (5a)  and  (8)  for  nonequilibrium 
conditions. 

Mean  field  theory  of  RNA  replication 

Previously,  we  derived  a  mean  field  theory  for  the  templated 
replication  and  showed  with  numerical  tests  that  it  was  exact  for 
the  two-parameter  Jukes-Cantor  model  rates  [34],  Here,  we 
reproduce  the  analytical  derivation  and  expand  it  to  show  that  the 
mean  field  theory  becomes  exact  for  symmetric  template  models 
(i.e.,  four-parameter  models  with  the  set  of  k+ (a\b)  for  all  a 
independent  of  the  identity  of  b),  of  which  the  Jukes-Cantor  model 
is  a  special  case.  We  also  test  this  conclusion  by  comparing  the 
analytical  results  with  numerical  simulations  (Figure  3).  The 
particular  version  of  the  mean  field  theory  we  adopted  for  general 
rates  in  this  paper  [Eq.  (5)]  can  then  be  considered  as  a 
generalization  of  these  expressions. 

Considering  the  elongation  of  RNA  depicted  in  Figure  1  A,  the 
master  equation  for  the  probability  Pn(a"\b")  to  have  a  chain  with 
length  n  and  sequence  an  =  {a\,  ■  ■  ■  ,«„}  under  a  template  b 
(assumed  infinitely  long;  may  refer  to  the  whole  template  sequence 


where  a, b=  1,2,  ••  •  ,m  {m  =  4  and  a,f>  =  {A,U,G,C}  in  the  main 
text).  We  introduce  the  reduced  distribution  [32] 

P^(a„a„-i  •  •  ■a„^i+i,t\b)=  ^  Pn(a",t\b) 

{ai},i<n-l  (33) 

=Pn(t\b)q{an  -  ■  ■  a„-i+\\b), 

where  pn(t\b)  is  the  probability  to  find  the  chain  with  length  n  at 
time  t  under  the  given  template  b,  and  q  is  the  conditional 
probability  of  having  the  indicated  sequence  for  the  given  chain  of 
length  n.  In  writing  q  as  time  independent,  we  have  implicitiy 
assumed  the  stationary  limit  where  q  represents  the  asymptotic 
monomer  distributions  near  the  terminus  of  the  growing  chain. 
Our  numerical  simulations  confirm  the  existence  of  such 
distributions  for  1=1.  Conversely,  the  chain  length  distribution 
pn(t\b)  supports  a  peak,  <«)>=  ^d„np„(t\b),  moving  with  a 
constant  velocity,  v=  "}2nndpn(t\b)/dt,  which  depends  on  the 
entire  template  sequence  b  in  general. 

A  special  case  of  particular  interest  is  the  symmetric  template 
models,  for  which  the  set  of  rates  would  be  specified  completely  by 
at  most  m  parameters  instead  of  rrr.  The  simplest  example  is  the 
two-parameter  model  [34]  given  in  Eq.  (4).  For  such  models,  we 
may  write 


pn(t\b)=p„(t).  (34) 

The  physical  interpretation  behind  Eq.  (34)  is  that  the  monomer 
addition  and  deletion  at  the  n’th  site  on  the  template  are  solely 
determined  by  the  set  of  rates  k+(an\bn),  which  are  independent 
of  the  identity  of  bn.  The  probability  for  chain  growth,  therefore, 
should  be  independent  of  template  sequences.  We  also  assume 
that 


q(a  a\b  b)  =  q(a  \ b  )q(a\b),  (35) 

which  is  expected  because  the  rates  k-{a\b)  are  all  local  in  their 
dependence  on  nucleotides.  Numerical  tests  suggest  that  Eq.  (35) 
is  generally  valid  for  arbitrary  rate  constants  (Figure  3B). 

Using  Eqs.  (34),  (35)  and  summing  both  sides  of  Eq.  (32)  over 
a,  (/  =  1 ,  •  •  •  ,n  —  1),  we  have 

Pn  q{a„  |  b„)  =  k+  (a„  \  b„)pn _  i  + 

(36) 

[J  Pn+i-J+P„-k  ( an\bn)p„]q(a„\b„ ), 

where 


/+  =  ^k+{a\b),  (37a) 

a 
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J  =  y  k  (a\b)q(ci\b),  (37b) 

a 

are  the  total  fluxes  for  chain  growth  and  shrinkage.  Any  expression 
involving  summations  over  a  of  k+(a\b)  is  independent  of  b  for 
symmetric  template  models. 

Equation  (36)  is  valid  for  any  {<?„,&„},  which  we  replace  by 
{a,b}.  We  note  that  Pn  =  1>  J2„Pn  =  0,  and  E„  nPn  =  v •  If we 
sum  both  sides  of  Eq.  (36)  over  a  using  q(a\b )  =  1,  multiply  by 
n,  and  sum  over  n,  we  get 

V  =  J+  ^  —(J+  +j)  npn  +  J~  y^  npn+ 1 

”  "  "  (38) 

=  ^[(«+  1)/+  —  n(J+  +  J  )  +  (n  —  l)J  ]p„  =  J+  —  J. 

n 


[34]).  Here,  we  used  a  different  approach,  generalizing  Eq.  (40) 
into  Eq.  (5a),  which  introduces  a  template-dependent  velocity  v(b). 
We  found  this  mean  field  theory  to  give  better  agreements  with 
simulations  for  asymmetric  rates  especially  when  combined  with 
averages  over  b  defined  as  the  harmonic  mean,  i.e.,  Eq.  (3). 
Because  harmonic  means  are  not  additive,  the  summation  must 
precede  the  average  in  Eq.  (5d)  within  the  current  approach. 
Equation  (5)  becomes  exact  for  symmetric  template  models. 

In  applying  the  mean  field  theory  expressions,  all  forward  rates 
are  first  scaled  with  k  [k+ (a\b)—>K  +  (a\b)  =  k+ (a\b)/ k],  and  Eq. 
(41)  is  solved  for  each  b  to  find  v(b).  Although  it  is  a  quartic 
equation  with  respect  to  v(b),  there  was  always  only  one  solution 
for  0<v(6)<l.  The  mean  velocity  v,  error  rate  /(,  and 
thermodynamic  force  F  then  follows  from  Eqs.  (5c),  (5d),  and 
(7).  The  dependence  of  v  and  g  on  F  are  obtained  by  treating  g  as 
a  parameter  to  be  eliminated  (Figure  3)  [62]. 


If  we  sum  both  sides  of  Eq.  (36)  with  respect  to  n, 

Q  =  k+  (a\b)  +  [J-  —  k~(a\b)  —  J+]  q(a\b). 


Limiting  behavior  of  elongation  properties 

Equilibrium  is  reached  when  v  =  0  and  F  =  0,  which  occurs 
when  v(b)  =  0  for  all  b\  from  Eqs.  (5a),  (8),  and  (9),  we  verify  Eqs. 
(29)  and  (31),  and  the  maximum  error  rate  is 


or  with  Eq.  (38), 


q(a\b)  = 


k+(a\b) 
v  +  k~  (a\b) 


(40) 


Equations  (37b),  (38),  (40),  and  Y2,a  q(a\b)  =  1  which  was  used  in 
the  derivation,  form  a  set  of  self-consistent  equations  for  q(a\b) 
[m  +  2  equations,  m  copies  of  Eq.  (40),  the  normalization,  and  Eq. 
(37b);  for  m  +  2  unknown,  q(a\b),  v,  and  /~].  Because  v  is 
independent  of  b,  this  set  of  equations  is  most  conveniently  solved 
by  imposing  the  normalization  condition  to  Eq.  (40)  to  determine 
v: 


Pm  ax 


m  —  1 
m—l+rj’ 


(43) 


which  would  be  equal  to  3/4  if  t;  =  1  and  m  =  4.  Far  from 
equilibrium,  where  F— >oo,  g  =  o o  and  k~ (a\b)  =  0,  Eqs.  (5a)  and 
(41)  give  v(7>)->  Y,ak+  (a\b)  =  vma(b)  and  v->vma x  =  k,  where  k  is 
given  by  Eq.  (2a).  The  error  rate  in  this  limit  approaches  its  lower 
bound, 


/E a*b*  k+(a\b)\ 

\  J2ak+(a\b) 


(in  —  l)y, 


(44) 


_  yr  k+(a\b ) 

~  ^v+k-(a\bY 


(41) 


which  defines  the  inverse  fidelity  parameter  y  via  Eq.  (2b).  The 
analogous  limits  of  /.h  can  a^0  he  calculated  from  Eq.  (6): 


We  note  that  Eq.  (41)  leads  to  a  unique  v  independent  of  b  because 
of  the  symmetry  of  rates  with  respect  to  b.  The  mean  error  rate  is 
given  by 


fh.,min 


k+  (a\b)k+  (b\c) 

Vmax  (b)  ax  (c) 


(45) 


b=J2  c^a\b^  (42( 

a¥=b* 

again  independent  of  b.  In  Figure  3A,  we  test  Eqs.  (40)  and  (41) 
with  an  example  set  of  symmetric  template  model  rates.  The 
comparison  with  numerical  simulations  shows  that  the  expressions 
are  exact  for  symmetric  template  models.  We  also  tested  the 
factorization  assumption,  Eq.  (35),  for  the  general  case  (pol  /? 
kinetics,  Table  2)  in  Figure  3B,  which  suggests  that  it  is  generally 
valid. 

Without  the  symmetry  of  kinetic  rates  with  respect  to  template 
base  identity,  the  main  difficulty  in  the  analytical  treatment  is  that 
Eq.  (34)  is  no  longer  valid,  and  the  velocity  v  depends  on  the  entire 
template  sequence.  There  are  a  number  of  ways  to  generalize  the 
exact  expressions,  Eqs.  (40)  and  (41),  into  cases  where  the  kinetic 
rates  do  depend  on  the  identity  of  template  bases.  One  way, 
demonstrated  in  Ref.  [34],  is  to  introduce  averages  over  template 
bases  to  Eq.  (36)  to  symmetrize  q(a\b)  and  v  over  the  distribution 
of  b.  An  average  over  b  of  the  right  hand  side  of  Eq.  (41) 
determines  the  mean  velocity  v  independent  of  b  (Eq.  (12)  of  Ref. 


and 


EE  e[geq(a\b)+geq(b\c)\ 

a^c  b 


(46) 

=  e2geq  n^2(m—l)(2ri+m—2)  = - ^(m—2+2n), 

(m—l  +  q)2 

where  in  the  second  equality  in  Eq.  (46),  the  average  over  c  was 
omitted  because  g(a\b)  is  symmetric  [Eq.  (9)]  and  the  three  terms 
in  the  square  brackets  correspond  to  a  =  b  +  c,  a  +  b  =  c,  and 
a  +  b  +  c  cases,  respectively.  For  m  =  4,  jU2„Mv  =  3/4  =  Jumax  if 
>7=1,  and  p2jnax  =  0.732  with  >7  =  1.73  (dotted  red  line  in 
Figure  5D). 


=  (m—  1) 


ggec/  ggeq  ggeq 

ggeq - 1_  ggeq - 1_  _  2 )egeQ  — — 

rj  rj  rj2 
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Stochastic  simulation  of  single-molecule  elongation 

Numerical  simulations  of  single-molecule  RNA  replication 
kinetics  were  performed  with  the  Gillespie  algorithm  [58]  applied 
to  Eq.  (32)  [32].  A  sufficiently  long  template  sequence  { bn }  was 
pre-generated  for  the  simulations  using  random  sequences  with 
equal  distributions  u(b)  —  1/4.  The  initial  condition  was  chosen  as 
n  =  0  (with  rate  constants  assigned  arbitrarily  for  n<  2),  and  only 
conditions  that  lead  to  positive  velocities  were  considered.  For  a 
given  set  of  k+(a\b)  [or  K(a\b)  =  k+(a\b)/k],  a  value  of  g  is 
chosen,  Eq.  (8)  is  used  to  generate  k~(a\b),  and  simulations  are 
run  to  obtain  v  =  n/t,  where  n  and  t  are  the  length  of  chain  grown 
and  time  elapsed,  respectively.  The  error  rates  p  and  p2,  and  the 
thermodynamic  force  F  are  obtained  by  first  calculating  q(a\b) 
over  the  chain  and  using  Eqs.  (5d)  and  (7).  This  procedure  is 
repeated  for  different  values  of  g  to  yield  v  and  p2  as  functions  of 
F.  Typically,  simulations  were  run  up  to  t=108A'_1  and 
properties  were  averaged  over  the  entire  chain  grown.  For 
poliovirus  3Dpo1,  simulations  were  also  performed  using  templates 
generated  by  repeating  the  poliovirus  sequence  [63]  (diamonds  in 
Figure  4B). 


Test  case  for  quasispecies  dynamics 

For  testing  the  Gillespie  simulation  of  RNA  population 
dynamics,  we  used  the  single-peak  Eigen  landscape  (18)  without 
back-mutation,  for  which  the  quasispecies  dynamics  (15)  can  be 
easily  integrated.  Although  more  advanced  methods  pioneered  by 
Saakian  and  coworkers  [17,20—22,24]  allow  exact  analyses  of  the 
Eigen  model,  the  following  simple  treatment  suffices  for  our 
puipose  of  testing  numerical  simulations  because  for  moderately 
large  M ,  the  effect  of  back-mutations  become  negligible.  Writing  a 
vector  n  =  [»i  n2]T ,  where  n\  and  n2  are  the  total  numbers  of 
individuals  with  the  master  sequence  and  mutants,  respectively, 
and  ignoring  back-mutations,  Eq.  (15)  can  be  written  as 


where 


R  = 


qn 

.(1  -q)n 


0  ■ 
>'2_ 


(51) 


(52) 


Stochastic  simulation  of  evolutionary  dynamics 

In  the  stochastic  form  of  the  Eigen  model  given  by  Eqs.  (23),  the 
total  rate  of  transformation  at  a  given  time  t  is 

a  Qjiri  +  rj  n,  =  2 ’.Nr,  (47) 

where  JT  Qjj  =  1  was  used.  In  a  simulation,  a  random  number 
0  <  Z\  <  1  with  a  uniform  distribution  is  drawn,  and 


and  q  =  (  1  —  Pi)M  ■  Diagonalizing  R  and  integrating,  we  get 

■V-1-n(0),  (53) 

where  2\  —qr  1  and  A2  =  r2  are  the  two  eigenvalues  of  R  and  V  is 
the  eigenvector  matrix.  We  then  obtain  the  time-dependent  master 
sequence  frequency  p\  =n\/(n\  +w2), 


n(0  =  eR,-n(0)  =  V- 


a  !,  1 

A  t=  -In  — 
a  z\ 


(48) 


P\(t)  = 


qA  —  1 

A-l-(l-q)Ae(r2-‘>r0r 


(54) 


determined  the  time  t  +  Al  of  the  next  replication/degradation 
event.  A  second  random  number  0<z2  <  1  was  drawn  next,  which 
chooses  one  (k,  k  =  1 ,  •  ■  ■  ,2 Ng)  of  2 Ng  reactions  ( Ng  replications 
and  Ng  degradations,  where  Ng  is  the  total  number  of  distinct 
genotypes  present  within  the  population)  from  Eqs.  (23)  following 
Ref.  [58]: 


k—  1  k 

a/  <  az2  < 
l=\  1= 1 


(49) 


where 


Figures  7  and  8  test  numerical  simulations  with  Eq.  (54)  and  its 
stationary  limit,  pi(co)  =  (qA  —  1  )/(A  —  1),  respectively. 

Starvation  process 

During  an  idealized  starvation  process,  a  single  genotype  is 
placed  inside  a  medium  containing  Nffi  NTPs  and  Nf®  PPi’s  with 
the  corresponding  initial  thermodynamic  force  Fj.  As  replication 
progresses,  F  decreases  via 


F  =  F0+  In 


Nm 

Np 


„  ,  N^-MN 

Fo  +  In  — tqs - 

Np  +  MN 


(55) 


f  (1  <k<Ng,  replication), 

\  rrik  (Ng  +  1  <  k <  2 Ng,  degradation). 


(50) 


assuming  rapid  mixing,  and  N  grows  via  N  =  rN.  Equation  (55) 
can  be  solved  for  N  to  give 


For  a  degradation  event,  the  number  of  replicators  is  updated  as 
ftk—Ng  —*Mk—Ng  —  1  ■  For  a  replication,  a  progeny  genotype  j  is 
produced  from  i  =  k  by  attempting  to  mutate  each  nucleotide  into 
3  different  bases  with  probability  pj 3,  followed  by  the  update 
rij-*nj+  1.  Since  the  total  number  of  possible  genotypes  4M  is 
exponentially  large  for  even  moderate  values  of  M,  exact 
enumerations  of «/  for  all  possible  genotypes  was  avoided.  Instead, 
the  simulation  proceeded  by  first  creating  from  the  initial 
distribution  a  list  of  genotypes  for  which  «,  >  0,  and  adding  newly 
encountered  genotypes  to  the  list  as  mutations  occurred. 


_„F-F0  jy(0) 
m  p 

(\+eF-po)M 


(56) 


Introducing  the  fractional  population  size  £  with  respect  to  the 
asymptotic  population  Nro  reached  in  the  limit  of  equilibrium 
(*■- 0), 


_  N  (l+e  Fo)(eF<  —eF) 
=  NZ=(eFi-l)(l+eF-Fo), 


or 
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77  _  1n  *  ^  '  /CO\ 

{\+e-Fo)e-Fi  +  ie-Fo{\-e-FiY  1  j 

Equation  (58)  with  v  =  v(F)  and  fl  =  li(F)  give  the  dependence  of 
mean  fitness  on  ii;  during  a  starvation  process  with  the  initial 
condition  F  =  Fj.  This  procedure  assumes  that  the  early  stages  of 
growth  with  finite  N  for  which  vg  deviates  from  Eq.  (19)  make 
negligible  contributions.  The  mean  fitness  averaged  over  the 
process  is 


This  integral  was  performed  using  trapezoidal  rules  to  obtain 
Figure  10. 
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